\usepackage{titling} \usepackage{color} \definecolor{RB_blue}{RGB}{0, 108, 149} \pretitle{ \begin{center} \color{RB_blue} \huge{\bf A Guide to Bayesian Modeling} \vspace{6.0cm} \setlength{\arrayrulewidth}{0.5mm} \setlength{\tabcolsep}{18pt} \renewcommand{\arraystretch}{1.5} \color{black} \includegraphics[]{logo.jpg}\\[\bigskipamount] \vspace{6.0cm} \begin{flushright} \begin{tabular}{ c } \small Author: Belias Michael\\ \small PhD in Biostatistics \\ \small Athens, 20 April 2017\\ \end{tabular} \end{flushright} \newpage } \posttitle{\end{center}} \usepackage{fancyhdr} \pagestyle{fancy} \rhead{\includegraphics[width = .07 \textwidth]{small_logo.jpg}} \fancyfoot[CE,CO]{\leftmark} \fancyfoot[LE,RO]{\thepage} \renewcommand{\headrulewidth}{2pt} \renewcommand{\footrulewidth}{1pt} \paperheight = 875pt

Bayesian Analysis

Michael Belias

April 20, 2017

Introduction

According to the Oxford dictionary, statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. Data may be of any applied science field such as medical, finance, social, physics etc and they can be separated into 2 types quantitative and qualitative.

The typical steps of a statistical analysis are seven in general :

  1. Define the problem

  2. Data collection and manipulation

  3. Explore the data

  4. Using the above three decide the model that will be used

  5. Fit the model

  6. Check the model and develop if necessary

  7. Make the final model and infere

The above step are not distinct and in some cases there are overlaps and more steps nested. The same principles can be applied in the Bayesian Framework too.

In this tutorial we will learn:

Common probability distributions

Discrete distributions

Uniform

The uniform distribution is the simplest discrete probability distribution. It assigns equal probability to N different outcomes, usually represented with numbers 1,2,….,N .

X ~ Uniform(N)

\(P(X = x|N) = 1/N\) for x = 1,2,…N

\(E[X]= \frac{N+1}{2}\)

\(Var[X]= \frac{N^2+1}{12}\)

One common example is the outcome of throwing a fair six-sided die where N=6.

Bernoulli

The Bernoulli distribution is used for binary outcomes, such as 0 and 1. It has one parameter p, which is the probability of “success” frequently geting 1 (or any value we set). X ~ Bern(p)

\(P(X = x | p) = p^x(1-p)^{1-x}\) for x = 0, 1

E[X] = p

Var[X] = p(1-p)

One common example is the outcome of flipping a fair coin (p = 0.5)

Binomial

The binomial distribution counts the number of “successes” in n independent Bernoulli trials (each with the same probability of success). Thus if \(X_1 , X_2 ,..., X_n\) are independent Bernoulli(p) random variables, then Y = \(\sum^n_{i=1} X_i\) is binomial distributed.

Y ~Binom(n, p)

P(Y= y|n,p) = \(\binom{n}{y} {p^y}(1-p)^{(n-y)}\) , for y = 0,1, …, n

E[Y] =np

Var[Y]= np(1-p)

where \(\binom{n}{y}= \frac{n!}{y!(n-y)!}\) .

Poisson

The Poisson distribution is used for counts, and arises in a variety of situations. The parameter \(\lambda\) > 0 is the rate at which we expect to observe the thing we are counting.

X~Pois(\(\lambda\))

\(P(X= x|\lambda) = \frac{\lambda^xexp(-\lambda)}{x!}\)

E[X] = \(\lambda\)

Var[X] = \(\lambda\)

A Poisson process is a process wherein events occur on average at rate \(\lambda\), events occur one at a time, and events occur independently of each other.

Example:

Significant earthquakes occur in the Western United States approximately following a Poisson process with rate of two earthquakes per week. What is the probability there will be at least 3 earthquakes in the next two weeks? Answer: the rate per two weeks is 2*2 = 4, so let X ~Pois(4) and we want to know \(P(X \geq 3)=1-(X\leq 2) =1- P(X = 0) - P( X = 1 ) - P( X = 2) =1 - e^{-4} _ 4e^{-4} -\frac{4^2 e^{-4}}{2} = 1 - 13e^{-4} = 0.762\)

Note that 0! = 1 by definition.

Geometric

The geometric distribution is the number of failures before obtaining the first success, i.e., the number of Bernoulli failures until a success is observed, such as the first head when flipping a coin. It takes values on the positive integers starting with 0 (alternatively, we could count total trials until first success, in which case we would start with 1).

X~Geo(p)

\(P(X=x|p) = p(1-p)^x\) , for x=1,2,…

\(E[X] = \frac{1-p}{p}\)

If the probability of getting a success is p, then the expected number of trials until the first success is 1=p and the expected number of failures until the first success is (1 - p)=p.

Negative Binomial

The negative binomial distribution extends the geometric distribution to model the number of failures before achieving the rth success. It takes values on the positive integers starting with 0.

Y~NegBinom(r,p)

P(Y= y|n,p) = \(\binom{r + y - 1}{y} {p^r}(1-p)^{(y)}\) for y=1,2,…

E[Y] = \(\frac{r(1-p)}{p}\)

Var[Y] = \(\frac{r(1-p)}{p^2}\)

Note that the geometric distribution is a special case of the negative binomial distribution where r = 1. Because 0 < p < 1, we have E[Y] < Var[Y]. This makes the negative binomial a popular alternative to the Poisson when modeling counts with high variance (recall, that the mean equals the variance for Poisson distributed variables).

Multinomial

Another generalization of the Bernoulli and the binomial is the multinomial distribution, which is like a binomial when there are more than two possible outcomes. Suppose we have n trials and there are k different possible outcomes which occur with probabilities \(p_1,p_2,... p_k\). For example, we are rolling a six-sided die that might be loaded so that the sides are not equally likely, then n is the total number of rolls, k = 6, p1 is the probability of rolling a one, and we denote by \(x_1,x_2, ...,x_6\) a possible outcome for the number of times we observe rolls of each of one through six, where \(\sum_{i=1}^{6}x_i = n\) and \(\sum_{i=1}^{6}p_i = 1\)

Bayesian Theory

History

Bayesian statistics are based on the homonymous Bayes’ theorem or rule, invented by Thomas Bayes, which was a british reverend the 1740s . His primary field of studying was theology but Bayes was also “amateur” mathematician. He was influenced by David Hume a philosopher teacher while his studies in Edinburgh proposing that we can only rely on what we learn from experience. The probabilities as a mathematical field these days where just emerging being able to solve simple problems like what is the probability of observing an effect given a cause? but not the inverse P(cause | effect). Bayes gave a simple example of tossing balls on a table and recording where they stop (to the left or to the right side of the table), noting that the more balls are thrown, the better we may infere if the ball-tosing is bias to a side. This is nowadays called a learning process and although it was a remarkable finding Bayes forgot it in a drawer (!) until his death.Richard Price found it and after studing his papers for 2 years and making some corrections he finally published An Essay toward solving a Problem in the Doctrine of Chances”. 1763.

Still the theorem was just an example not having the final form of today and even after this publication no-one really continued the development except of Laplace, who was trying to solve an astronomical problem , studied Price’s paper developed a first version of what we now call Bayes theorem. The reception of Laplace’s proposal was slightly hostile due to the inherent challenges such as the equal prior probabilities, being subjective and the serious technical computational problems in practice, which is still a great issue .

Bayes theorem

Bayes theorem is calculating the probability event given prior knowledge of conditions that might be related to the event. Bayes’ theorem is stated mathematically as the following equation (Efron, 2013) :

\({P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}}\)

\({P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}}\)

where A and B are events and P(B) \(\neq\) 0.

This is the basis of Bayesian inference which is a particular approach to statistical inference, with it’s own interpretation and When applied, the probabilities involved in Bayes’ theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a subjective degree of belief should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.

Monte Carlo estimation

Simulation and CLT

Before we learn how to simulate from complicated posterior distributions, let’s review some of the basics of Monte Carlo estimation. Monte Carlo estimation refers to simulating hypothetical draws from a probability distribution in order to calculate important quantities. These quantities might include the mean, the variance, the probability of some event, or quantiles of the distribution. All of these calculations involve integration, which except for the simplest distributions, can be very difficult or impossible.

Suppose we have a random variable \(\hat \theta\) that follows a Gamma(a,b). Let’s say a=2 and b=1/3 , where b is the rate parameter. To calculate the mean of this distribution, we would need to compute the following integral

\[\text{E}(\theta) = \int_0^\infty \theta f(\theta) d\theta = \int_0^\infty \theta \frac{b^a}{\Gamma(a)}\theta^{a-1}e^{-b\theta} d\theta \,\]

It is possible to compute this integral, and the answer is a/b (6 in this case). However, we could verify this answer through Monte Carlo estimation. To do so, we would simulate a large number of draws (call them \(\theta_i^*\) for i = 1,2,…,m) from this gamma distribution and calculate their sample mean. Why can we do this? Recall from the previous course that if we have a random sample from a distribution, the average of those samples converges in probability to the true mean of that distribution by the Law of Large Numbers. Furthermore, by the Central Limit Theorem (CLT), this sample mean \(\bar \theta^* = \frac{1}{m} \sum_{i=1}^{m} \theta_i^*\) approximately follows a normal distribution with mean E(\(\theta\)) and variance Var(\(\theta\))/m. The theoretical variance of \(\theta\) is the following integral: \[\text{Var}(\theta) = \int_0^\infty (\theta-E(\theta))^2 f(\theta) d\theta \,\]

Just like we did with the mean, we could approximate this variance with the sample variance \(\frac{1}{m}\sum_{i=1}^m (\theta_i^* - \bar{\theta}^*)^2\)

Calculating probabilities

This method can be used to calculate many different integrals. Say h(\(\theta\)) is any function and we want to calculate \(\int h(\theta) p(\theta) d\theta\) This integral is precisely what is meant by E(h(\(\theta\))), so we can conveniently approximate it by taking the sample mean of h(\(\theta_i^*\)). That is, we apply the function hh to each simulated sample from the distribution, and take the average of all the results.

One extremely useful example of an hh function is is the indicator \(I_A(\theta)\) where A is some logical condition about the value of \(\theta\). To demonstrate, suppose \(h(\theta) = I_{[\theta<5]}(\theta)\) , which will give a 1 if \(\theta<5\) and 0 otherwise.

The E(h(\(\theta\))) = \(\int_0^\infty I_{[\theta<5]}(\theta) p(\theta) d\theta = \int_0^5 1 \cdot p(\theta) d\theta + \int_5^\infty 0 \cdot p(\theta) d\theta = P(\theta < 5) \,\) . This means we can approximate the probability that \(\theta < 5\) by drawing many samples \(\theta_i^*\) , and approximating this integral with \(\frac{1}{m} \sum_{i=1}^m I_{\theta^* < 5} (\theta_i^*)\). This expression is simply counting how many of those samples come out to be less than 55, and dividing by the total number of simulated samples. So simple!

Likewise, we can approximate quantiles of a distribution. If we are looking for the value \(\zeta\) such that P(\(\theta\) < z) = 0.9 , we simply arrange the samples \(\theta_i^*\) in ascending order and find the smallest drawn value that is greater than 90% of the others.

Monte Carlo error

How good is an approximation by Monte Carlo sampling? Again we can turn to the CLT, which tells us that the variance of our estimate is controlled in part by m . For a better estimate, we want larger m .

For example, if we seek E(\(\theta\)), then the sample mean \(\bar \theta^*\) approximately follows a normal distribution with mean E(\(\theta\)) and variance Var(\(\theta\))/m. The variance tells us how far our estimate might be from the true value. One way to approximate Var(\(\theta\)) is to replace it with the sample variance. The standard deviation of our Monte Carlo estimate is the square root of that, or the sample standard deviation divided by \(\sqrt m\). If m is large, it is reasonable to assume that the true value will likely be within about two standard deviations of your Monte Carlo estimate.

Marginalization

We can also obtain Monte Carlo samples from hierarchical models. As a simple example, let’s consider a binomial random variable where \(y \mid \phi \sim \text{Bin}(10, \phi)\), and further suppose \(\phi\) is random (as if it had a prior) and is distributed beta \(\phi ~ Beta(2,2)\). Given any hierarchical model, we can always write the joint distribution of y and \(\phi\) as \(p(y, \phi) = p(y \mid \phi)p(\phi)\) using the chain rule of probability. To simulate from this joint distribution, repeat these steps for a large number m :

This will produce m independent pairs of \((y^*, \phi^*)_i\) drawn from their joint distribution. One major advantage of Monte Carlo simulation is that marginalizing is easy. Calculating the marginal distribution of y, \(p(y) = \int_0^1 p(y, \phi) d\phi\) might be challenging. But if we have draws from the joint distribution, we can just discard the \(\phi_i^*\) raws and use the \(y_i^*\)as samples from their marginal distribution. This is also called the prior predictive distribution introduced in the previous course.

In the next segment, we will demonstrate some of these principles. Remember, we do not yet know how to sample from the complicated posterior distributions introduced in the previous lesson. But once we learn that, we will be able to use the principles from this lesson to make approximate inferences from those posterior distributions.

Markov chains

Definition If we have a sequence of random variables \(X_1,X_2,...,X_n\) where the indices 1,2,…,n represent successive points in time, we can use the chain rule of probability to calculate the probability of the entire sequence:

\(p(X_{t+1} | X_t, X_{t-1}, \ldots, X_2, X_1 ) = p(X_{t+1} | X_t) \,\)

Markov chains simplify this expression by using the Markov assumption. The assumption is that given the entire past history, the probability distribution for the random variable at the next time step only depends on the current variable. Mathematically, the assumption is written like this:

\(p(X_{t+1} | X_t, X_{t-1}, \ldots, X_2, X_1 ) = p(X_{t+1} | X_t) \,\)

for all t=2,…,n. Under this assumption, we can write the first expression as \(p(X_1, X_2, \ldots X_n) = p(X_1) \cdot p(X_2 | X_1) \cdot p(X_3 | X_2) \cdot p(X_4 | X_3) \cdot \ldots \cdot p(X_n | X_{n-1}) \, ,\)

which is much simpler than the original. It consists of an initial distribution for the first variable, \(P(X_1)\) , and n - 1 transition probabilities. We usually make one more assumption: that the transition probabilities do not change with time. Hence, the transition from time tt to time t+1 depends only on the value of \(X_t\).

Examples of Markov chains

Discrete Markov chain

Suppose you have a secret number (make it an integer) between 1 and 5. We will call it your initial number at step 1. Now for each time step, your secret number will change according to the following rules:

  1. Flip a coin.
  • If the coin turns up heads, then increase your secret number by one (5 increases to 1).
  • If the coin turns up tails, then decrease your secret number by one (1 decreases to 5).
  1. Repeat *n** times, and record the evolving history of your secret number.

Before the experiment, we can think of the sequence of secret numbers as a sequence of random variables, each taking on a value in {1,2,3,4,5}. Assume that the coin is fair, so that with each flip, the probability of heads and tails are both 0.5.

Does this game qualify as a true Markov chain? Suppose your secret number is currently 4 and that the history of your secret numbers is (2,1,2,3). What is the probability that on the next step, your secret number will be 5? What about the other four possibilities? Because of the rules of this game, the probability of the next transition will depend only on the fact that your current number is 4. The numbers further back in your history are irrelevant, so this is a Markov chain.

This is an example of a discrete Markov chain, where the possible values of the random variables come from a discrete set. Those possible values (secret numbers in this example) are called states of the chain. The states are usually numbers, as in this example, but they can represent anything. In one common example, the states describe the weather on a particular day, which could be labeled as 1-fair, 2-poor.

Random walk (continuous)

Now let’s look at a continuous example of a Markov chain. Say Xt=0Xt=0 and we have the following transition model: \(p(X_{t+1} | X_t=x_t) = \text{N}(x_t, 1),\). That is, the probability distribution for the next state is Normal with variance 1 and mean equal to the current state. This is often referred to as a “random walk.” Clearly, it is a Markov chain because the transition to the next state \(X_{t+1}\)only depends on the current state \(X_t\).

R-code

Transition matrix

Let’s return to our example of the discrete Markov chain. If we assume that transition probabilities do not change with time, then there are a total of \(5^2 = 25\) potential transition probabilities. Potential transition probabilities would be from State 1 to State 2, State 1 to State 3, and so forth. These transition probabilities can be arranged into a matrix Q:

\[Q = \begin{pmatrix} 0 & .5 & 0 & 0 & .5 \\ .5 & 0 & .5 & 0 & 0 \\ 0 & .5 & 0 & .5 & 0 \\ 0 & 0 & .5 & 0 & .5 \\ .5 & 0 & 0 & .5 & 0 \\ \end{pmatrix}\]

where the transitions from State 1are in the first row, the transitions from State 2 are in the second row, etc. For example, the probability \(p(X_{t+1} = 5 \mid X_t = 4)\) can be found in the fourth row, fifth column.

The transition matrix is especially useful if we want to find the probabilities associated with multiple steps of the chain. For example, we might want to know \(p(X_{t+2}=3 \mid X_t=1)\) , the probability of your secret number being 3 two steps from now, given that your number is currently 1. We can calculate this as \(\sum_{k=1}^5 p(X_{t+2}=3 \mid X_{t+1}=k) \cdot p(X_{t+1}=k \mid X_t=1)\) , which conveniently is found in the first row and third column of \(Q^2\).

R-code

     [,1] [,2] [,3] [,4] [,5]
[1,] 0.50 0.00 0.25 0.25 0.00
[2,] 0.00 0.50 0.00 0.25 0.25
[3,] 0.25 0.00 0.50 0.00 0.25
[4,] 0.25 0.25 0.00 0.50 0.00
[5,] 0.00 0.25 0.25 0.00 0.50

Stationary distribution

Suppose we want to know the probability distribution of the your secret number in the distant future, say \(p(X_{t+h} | X_t)\) where h is a large number. Let’s calculate this for a few different values of h.

      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] 0.062 0.312 0.156 0.156 0.312
[2,] 0.312 0.062 0.312 0.156 0.156
[3,] 0.156 0.312 0.062 0.312 0.156
[4,] 0.156 0.156 0.312 0.062 0.312
[5,] 0.312 0.156 0.156 0.312 0.062
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] 0.248 0.161 0.215 0.215 0.161
[2,] 0.161 0.248 0.161 0.215 0.215
[3,] 0.215 0.161 0.248 0.161 0.215
[4,] 0.215 0.215 0.161 0.248 0.161
[5,] 0.161 0.215 0.215 0.161 0.248
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] 0.201 0.199 0.200 0.200 0.199
[2,] 0.199 0.201 0.199 0.200 0.200
[3,] 0.200 0.199 0.201 0.199 0.200
[4,] 0.200 0.200 0.199 0.201 0.199
[5,] 0.199 0.200 0.200 0.199 0.201

Notice that as the future horizon gets more distant, the transition distributions appear to converge. The state you are currently in becomes less important in determining the more distant future. If we let hh get really large, and take it to the limit, all the rows of the long-range transition matrix will become equal to (.2,.2,.2,.2,.2). That is, if you run the Markov chain for a very long time, the probability that you will end up in any particular state is 1/5=.2 for each of the five states. These long-range probabilities are equal to what is called the stationary distribution of the Markov chain.

The stationary distribution of a chain is the initial state distribution for which performing a transition will not change the probability of ending up in any given state. That is,

     [,1] [,2] [,3] [,4] [,5]
[1,]  0.2  0.2  0.2  0.2  0.2

One consequence of this property is that once a chain reaches its stationary distribution, the stationary distribution will remain the distribution of the states thereafter.

We can also demonstrate the stationary distribution by simulating a long chain from this example.

Now that we have simulated the chain, let’s look at the distribution of visits to the five states.

x
     1      2      3      4      5 
0.1996 0.2020 0.1980 0.1994 0.2010 

The overall distribution of the visits to the states is approximately equal to the stationary distribution.

As we have just seen, if you simulate a Markov chain for many iterations, the samples can be used as a Monte Carlo sample from the stationary distribution. This is exactly how we are going to use Markov chains for Bayesian inference. In order to simulate from a complicated posterior distribution, we will set up and run a Markov chain whose stationary distribution is the posterior distribution.

It is important to note that the stationary distribution doesn’t always exist for any given Markov chain. The Markov chain must have certain properties, which we won’t discuss here. However, the Markov chain algorithms we’ll use in future lessons for Monte Carlo estimation are guaranteed to produce stationary distributions.

Continuous example

The continuous random walk example we gave earlier does not have a stationary distribution. However, we can modify it so that it does have a stationary distribution.

Let the transition distribution be \(p(X_{t+1} | X_t=x_t) = \text{N}(\phi x_t, 1)\) where -1<\(\phi\)<1 .That is, the probability distribution for the next state is Normal with variance 1 and mean equal to \(\phi\) times the current state. As long as \(\phi\) is between -1 and* 1*, then the stationary distribution will exist for this model.

Let’s simulate this chain for \(\phi = -0.6\).

The theoretical stationary distribution for this chain is normal with mean 0 and variance 1/(1-\(\phi^2\)) which in our example approximately equals 1.562. Let’s look at a histogram of our chain and compare that with the theoretical stationary distribution.

It appears that the chain has reached the stationary distribution. Therefore, we could treat this simulation from the chain like a Monte Carlo sample from the stationary distribution, a normal with mean 0 and variance 1.562.

Because most posterior distributions we will look at are continuous, our Monte Carlo simulations with Markov chains will be similar to this example.

Monte Carlo Example

Metropolis-Hastings

Metropolis-Hastings is an algorithm that allows us to sample from a generic probability distribution (which we will call the target distribution), even if we do not know the normalizing constant. To do this, we construct and sample from a Markov chain whose stationary distribution is the target distribution. It consists of picking an arbitrary starting value, and iteratively accepting or rejecting candidate samples drawn from another distribution, one that is easy to sample.

Let’s say we wish to produce samples from a target distribution \(p(\theta) \propto g(\theta)\) where we don’t know the normalizing constant (since \(\int g(\theta) d\theta\) is hard or impossible to compute), so we only have \(g(\theta)\) to work with. The Metropolis-Hastings algorithm proceeds as follows.

  1. Select an initial value \(\theta_0\) .
  2. For i=1,…,m repeat the following steps:

If \(\alpha \ge 1\), then set \(\theta_i = \theta^*\) If \(\alpha < 1\), then set \(\theta_i = \theta^*\) with probability \(\alpha\), or \(\theta_i = \theta_{i-1}\) with probability 1-\(\alpha\).

Steps 2b and 2c act as a correction since the proposal distribution is not the target distribution. At each step in the chain, we draw a candidate and decide whether to “move” the chain there or remain where we are. If the proposed move to the candidate is “advantageous,” (\(\alpha \ge 1\)) we “move” there and if it is not “advantageous,” we still might move there, but only with probability \(\alpha\). Since our decision to “move” to the candidate only depends on where the chain currently is, this is a Markov chain.

Proposal distribution

One careful choice we must make is the candidate generating distribution\(q(\theta^* \mid \theta_{i-1})\) It may or may not depend on the previous iteration’s value of \(\theta\). One example where it doesn’t depend on the previous value would be if \(q(\theta)\)is always the same distribution. If we use this option, \(q(\theta)\) should be as similar as possible to \(p(\theta)\).

Another popular option, one that does depend on the previous iteration, is Random-Walk Metropolis-Hastings. Here, the proposal distribution is centered on \(\theta_{i-1}\). For instance, it might be a normal distribution with mean \(\theta_{i-1}\). Because the normal distribution is symmetric, this example comes with another advantage: \(q(\theta^* \mid \theta_{i-1}) = q(\theta_{i-1} \mid \theta^*)\), causing it to cancel out when we calculate \(\alpha\). Thus, in Random-Walk Metropolis-Hastings where the candidate is drawn from a normal with mean \(\theta_{i-1}\) and constant variance, the acceptance ratio is \(\alpha = \frac{g(\theta^*) }{g(\theta_{i-1})}\).

Acceptance rate

Clearly, not all candidate draws are accepted, so our Markov chain sometimes “stays” where it is, possibly for many iterations. How often you want the chain to accept candidates depends on the type of algorithm you use. If you approximate \(p(\theta)\) with \(q(\theta^*)\)and always draw candidates from that, accepting candidates often is good; it means \(q(\theta^*)\)is approximating \(p(\theta)\) well. However, you still may want qq to have a larger variance than pp and see some rejection of candidates as an assurance that qq is covering the space well.

As we will see in coming examples, a high acceptance rate for the Random-Walk Metropolis-Hastings sampler is not a good thing. If the random walk is taking too small of steps, it will accept often, but will take a very long time to fully explore the posterior. If the random walk is taking too large of steps, many of its proposals will have low probability and the acceptance rate will be low, wasting many draws. Ideally, a random walk sampler should accept somewhere between 23% and 50% of the candidates proposed.

In the next segment, we will see a demonstration of this algorithm used in a discrete case, where we can show mathematically that the Markov chain converges to the target distribution. In the following segment, we will demonstrate coding a Random-Walk Metropolis-Hastings algorithm in R to solve one of the problems from the end of Lesson 2.

Random walk with normal likelihood, t prior

Recall the model from the last segment of Lesson 2 where the data are the percent change in total personnel from last year to this year for n=10 companies. We used a normal likelihood with known variance and t distribution for the prior on the unknown mean. Suppose the values are y=(1.2,1.4,-0.5,0.3,0.9,2.3,1.0,0.1,1.3,1.9) . Because this model is not conjugate, the posterior distribution is not in a standard form that we can easily sample. To obtain posterior samples, we will set up a Markov chain whose stationary distribution is this posterior distribution.

Recall that the posterior distribution is:

\[p(\mu \mid y_1, \ldots, y_n) \propto \frac{\exp[ n ( \bar{y} \mu - \mu^2/2)]}{1 + \mu^2}\] The posterior distribution on the left is our target distribution and the expression on the right is our g(\(\mu\)).

The first thing we can do in R is write a function to evaluate g(\(\mu\)). Because posterior distributions include likelihoods (the product of many numbers that are potentially small), g(\(\mu\)) might evaluate to such a small number that to the computer, it effectively zero. This will cause a problem when we evaluate the acceptance ratio \(\alpha\). To avoid this problem, we can work on the log scale, which will be more numerically stable. Thus, we will write a function to evaluate

\[\log(g(\mu)) = n ( \bar{y} \mu - \mu^2/2) - \log(1 + \mu^2)\]

This function will require three arguments, \(\mu\) , \(\bar{y}\) and n.

Next, let’s write a function to execute the Random-Walk Metropolis-Hastings sampler with normal proposals.

Now, let’s set up the problem.

Finally, we’re ready to run the sampler! Let’s use m=1000 iterations and proposal standard deviation (which controls the proposal step size) 3.0, and initial value at the prior median 0

List of 2
 $ mu   : num [1:1000] -0.113 1.507 1.507 1.507 1.507 ...
 $ accpt: num 0.122

This last plot is called a trace plot. It shows the history of the chain and provides basic feedback about whether the chain has reached its stationary distribution.

It appears our proposal step size was too large (acceptance rate below 23%). Let’s try another.

[1] 0.946

[1] 0.38

Hey, that looks pretty good. Just for fun, let’s see what happens if we initialize the chain at some far-off value.

[1] 0.387

It took awhile to find the stationary distribution, but it looks like we succeeded! If we discard the first 100 or so values, it appears like the rest of the samples come from the stationary distribution, our posterior distribution! Let’s plot the posterior density against the prior to see how the data updated our belief about \(\mu\).

These results are encouraging, but they are preliminary. We still need to investigate more formally whether our Markov chain has converged to the stationary distribution. We will explore this in a future lesson.

Obtaining posterior samples using the Metropolis-Hastings algorithm can be time-consuming and require some fine-tuning, as we’ve just seen. The good news is that we can rely on software to do most of the work for us. In the next couple of videos, we’ll introduce a program that will make posterior sampling easy.

Gibbs sampling

So far, we have demonstrated MCMC for a single parameter. What if we seek the posterior distribution of multiple parameters, and that posterior distribution does not have a standard form? One option is to perform Metropolis-Hastings (M-H) by sampling candidates for all parameters at once, and accepting or rejecting all of those candidates together. While this is possible, it can get complicated. Another (simpler) option is to sample the parameters one at a time.

As a simple example, suppose we have a joint posterior distribution for two parameters \(\theta\) and \(\phi\), written \(P(\theta,\phi|y) \propto g (\theta,\phi)\) . If we knew the value of \(\phi\), then we would just draw a candidate for \(\theta\) and use \(g(\theta,\phi)\) to compute our Metropolis-Hastings ratio, and possibly accept the candidate. Before moving on to the next iteration, if we don’t know \(\phi\), then we can perform a similar update for it. Draw a candidate for \(\phi\) using some proposal distribution and again use \(g(\theta,\phi)\) to compute our Metropolis-Hastings ratio. Here we pretend we know the value of \(\theta\) by substituting its current iteration from the Markov chain. Once we’ve drawn for both \(\theta\) and \(\phi\), that completes one iteration and we begin the next iteration by drawing a new \(\theta\). In other words, we’re just going back and forth, updating the parameters one at a time, plugging the current value of the other parameter into \(g(\theta,\phi)\).

This idea of one-at-a-time updates is used in what we call Gibbs sampling, which also produces a stationary Markov chain (whose stationary distribution is the posterior).

Full conditional distributions

Before describing the full Gibbs sampling algorithm, there’s one more thing we can do. Using the chain rule of probability, we have \(p(\theta, \phi \mid y) = p(\theta \mid \phi, y) \cdot p(\phi \mid y)\). notice that the only difference between \(p(\theta, \phi \mid y)\) and \(p(\theta \mid \phi, y)\) is multiplication by a factor that doesn’t involve \(\theta\) . Since the \(g(\theta,\phi)\) function above, when viewed as a function of \(\theta\) is proportional to both these expressions, we might as well have replaced it with \(p(\theta \mid \phi,y)\) in our update for \(\theta\).This distribution \(p(\theta \mid \phi, y)\) is called the full conditional distribution for \(\theta\). Why use it instead of \(g(\theta,\phi)\)? In some cases, the full conditional distribution is a standard distribution we know how to sample. If that happens, we no longer need to draw a candidate and decide whether to accept it. In fact, if we treat the full conditional distribution as a candidate proposal distribution, the resulting Metropolis-Hastings acceptance probability becomes exactly 1.

Gibbs samplers require a little more work up front because you need to find the full conditional distribution for each parameter. The good news is that all full conditional distributions have the same starting point: the full joint posterior distribution. Using the example above, we have \(p(\theta \mid \phi,y) \propto p(\theta,\phi|y)\) where we simply now treat \(\phi\) as a known number. Likewise, the other full conditional is \(p(\phi \mid \theta ,y) \propto p(\theta,\phi\mid y)\) where here, we consider \(\theta\) to be a known number. We always start with the full posterior distribution. Thus, the process of finding full conditional distributions is the same as finding the posterior distribution of each parameter, pretending that all other parameters are known.

Gibbs sampler

The idea of Gibbs sampling is that we can update multiple parameters by sampling just one parameter at a time, cycling through all parameters and repeating. To perform the update for one particular parameter, we substitute in the current values of all other parameters.

Here is the algorithm. Suppose we have a joint posterior distribution for two parameters \(\theta\) and \(\phi\), written \(P(\theta,\phi\mid y)\). If we can find the distribution of each parameter at a time, i.e., \(P(\theta \mid \phi,y)\) and \(P(\phi\mid \theta,y)\), then we can take turns sampling these distributions like so:

1.Using \(\phi_{i-1}\)draw \(\theta_i\) from \(P(\theta | \phi= \phi_{i-1} ,y )\) .

2.Using \(\theta_i\), draw \(\phi_i\) from \(P(\phi|\theta=\theta_i,y)\).

Together, steps 1 and 2 complete one cycle of the Gibbs sampler and produce the draw for (\(\theta_i , \phi_i\)) in one iteration of a MCMC sampler. If there are more than two parameters, we can handle that also. One Gibbs cycle would include an update for each of the parameters.

Example

Normal likelihood, unknown mean and variance

Let’s make an example, where we have normal likelihood with unknown mean and unknown variance. The model is : \[ \begin{aligned} y_i \mid \mu, \sigma^2 &\overset{\text{iid}}{\sim} \text{N} ( \mu, \sigma^2 ), \quad i=1,\ldots,n \\ \mu &\sim \text{N}(\mu_0, \sigma_0^2) \\ \sigma^2 &\sim \text{IG}(\nu_0, \beta_0) \ \end{aligned} \].

We chose a normal prior for \(\mu\) because, in the case where \(\sigma^2\) is known, the normal is the conjugate prior for \(\mu\). Likewise, in the case where \(\mu\) is known, the inverse-gamma is the conjugate prior for \(\sigma^2\). This will give us convenient full conditional distributions in a Gibbs sampler.

Let’s first work out the form of the full posterior distribution. When we begin analyzing data, the JAGS software will complete this step for us. However, it is extremely valuable to see and understand how this works.

\[ \begin{aligned} \\ p( \mu, \sigma^2 \mid y_1, y_2, \ldots, y_n ) &\propto p(y_1, y_2, \ldots, y_n \mid \mu, \sigma^2) p(\mu) p(\sigma^2) = \prod_{i=1}^n \text{N} ( y_i \mid \mu, \sigma^2 ) \times \text{N}( \mu \mid \mu_0, \sigma_0^2) \times \text{IG}(\sigma^2 \mid \nu_0, \beta_0) \\ = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp \left[ -\frac{(y_i - \mu)^2}{2\sigma^2} \right] \times \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] \times \frac{\beta_0^{\nu_0}}{\Gamma(\nu_0)}(\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \\ \propto (\sigma^2)^{-n/2} \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] (\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \end{aligned} \]

From here, it is easy to continue on to find the two full conditional distributions we need. First let’s look at \(\mu\), assuming \(\sigma^2\) is known (in which case it becomes a constant and is absorbed into the normalizing constant): \[\begin{aligned} p(\mu \mid \sigma^2, y_1, \ldots, y_n) &\propto p( \mu, \sigma^2 \mid y_1, \ldots, y_n ) \\ &\propto \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] \\ &\propto \exp \left[ -\frac{1}{2} \left( \frac{ \sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} + \frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right) \right] \\ &\propto \text{N} \left( \mu \mid \frac{n\bar{y}/\sigma^2 + \mu_0/\sigma_0^2}{n/\sigma^2 + 1/\sigma_0^2}, \, \frac{1}{n/\sigma^2 + 1/\sigma_0^2} \right) \, , \end {aligned}\] which we derived in the supplementary material of the last course. So, given the data and \(\sigma^2\), \(\mu\) follows this normal distribution.

Now let’s look at \(\sigma^2\), assuming \(\mu\) is known: \[\begin{aligned} p(\sigma^2 \mid \mu, y_1, \ldots, y_n) &\propto p( \mu, \sigma^2 \mid y_1, \ldots, y_n ) \\ &\propto (\sigma^2)^{-n/2} \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] (\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \\ &\propto (\sigma^2)^{-(\nu_0 + n/2 + 1)} \exp \left[ -\frac{1}{\sigma^2} \left( \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \right] I_{\sigma^2 > 0}(\sigma^2) \\ &\propto \text{IG}\left( \sigma^2 \mid \nu_0 + \frac{n}{2}, \, \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \, . \end{aligned}\]

These two distributions provide the basis of a Gibbs sampler to simulate from a Markov chain whose stationary distribution is the full posterior of both \(\mu\) and \(\sigma^2\). We simply alternate draws between these two parameters, using the most recent draw of one parameter to update the other.

We will do this in R in the next segment.

Gibbs sampler in R

To implement the Gibbs sampler we just described, let’s return to our running example where the data are the percent change in total personnel from last year to this year for n=10 companies. We’ll still use a normal likelihood, but now we’ll relax the assumption that we know the variance of growth between companies, \(\sigma^2\), and estimate that variance. Instead of the t prior from earlier, we will use the conditionally conjugate priors, normal for \(\mu\) and inverse-gamma for \(\sigma^2\)

The first step will be to write functions to simulate from the full conditional distributions we derived in the previous segment. The full conditional for \(\mu\), given \(\sigma^2\) and data is

\(\text{N} \left( \mu \mid \frac{n\bar{y}/\sigma^2 + \mu_0/\sigma_0^2}{n/\sigma^2 + 1/\sigma_0^2}, \, \frac{1}{n/\sigma^2 + 1/\sigma_0^2} \right)\)

The full conditional for \(\sigma^2\) given \(\mu\) and data is:

\(\text{IG}\left( \sigma^2 \mid \nu_0 + \frac{n}{2}, \, \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right)\)

With functions for drawing from the full conditionals, we are ready to write a function to perform Gibbs sampling.

Now we are ready to set up the problem in \(\text{R}\).

            mu      sig2
[1,] 0.3746992 1.5179144
[2,] 0.4900277 0.8532821
[3,] 0.2536817 1.4325174
[4,] 1.1378504 1.2337821
[5,] 1.0016641 0.8409815
[6,] 1.1576873 0.7926196


Iterations = 1:1000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 1000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

       Mean     SD Naive SE Time-series SE
mu   0.9051 0.2868  0.00907        0.00907
sig2 0.9282 0.5177  0.01637        0.01810

2. Quantiles for each variable:

       2.5%    25%    50%   75% 97.5%
mu   0.3024 0.7244 0.9089 1.090 1.481
sig2 0.3577 0.6084 0.8188 1.094 2.141

As with the Metropolis-Hastings example, these chains appear to have converged. In the next lesson, we will discuss convergence in more detail.

Linear Regression 2nd example

New York City Crime Data

As an example of a Bayesian linear regression model, we look at New York City crime data from 1966 to 1967. The outcome variable (THEFT) is the increase or decrease in the seasonally adjusted rate of grand larcenies in 23 Manhattan police precincts from a 27-week pre-intervention period compared to a 58-week intervention period. The predictor variables are the percent increase or decrease in the number of police officers in a precinct (MAN), and whether the precinct was located uptown, midtown or downtown.

We specify the model as:

\[THEFT_i \sim N(\mu, \sigma^) \\ \mu = \beta_0 + \beta_1 * MAN + district \\ \frac{1}{\sigma^2} \sim \Gamma(0.001, 0.001) \\ \beta_0 \sim N(0, 100000) \\ \beta_1 \sim N(0, 100000)\]

We will need a prior for the effect of geographic area (district), but will discuss that in a moment. First, we consider how best to code an indicator variable in BUGS. There are two possible approaches. We can create a “design matrix” of dummy variables. In this approach we create two variables named DIST1 and DIST2, for which downtown precincts are coded 0,0 , midtown precincts are coded 1,0 and uptown precincts are coded 0,1. The BUGS code for this model would then be:

Logistic regression

For an example of logistic regression, we’ll use the urine data set from the boot package in R. The response variable is r, which takes on values of 0 or 1. We will remove some rows from the data set which contain missing values.

  r gravity   ph osmo cond urea calc
1 0   1.021 4.91  725   NA  443 2.45
2 0   1.017 5.74  577 20.0  296 4.49
3 0   1.008 7.20  321 14.9  101 2.36
4 0   1.011 5.51  408 12.6  224 2.15
5 0   1.005 6.52  187  7.5   91 1.16
6 0   1.020 5.27  668 25.3  252 3.34

Let’s look at pairwise scatter plots of the seven variables.

One thing that stands out is that several of these variables are strongly correlated with one another. For example gravity and osmo appear to have a very close linear relationship. Collinearity between xx variables in linear regression models can cause trouble for statistical inference. Two correlated variables will compete for the ability to predict the response variable, leading to unstable estimates. This is not a problem for prediction of the response, if prediction is the end goal of the model. But if our objective is to discover how the variables relate to the response, we should avoid collinearity.

We can more formally estimate the correlation among these variables using the corrplot package.

Variable selection

One primary goal of this analysis is to find out which variables are related to the presence of calcium oxalate crystals. This objective is often called “variable selection.” We have already seen one way to do this: fit several models that include different sets of variables and see which one has the best DIC. Another way to do this is to use a linear model where the priors for the \(\beta\) coefficients favor values near 0 (indicating a weak relationship). This way, the burden of establishing association lies with the data. If there is not a strong signal, we assume it doesn’t exist.

Rather than tailoring a prior for each individual \(\beta\) based on the scale its covariate takes values on, it is customary to subtract the mean and divide by the standard deviation for each variable.

         2          3          4          5          6          7 
-0.1403037 -1.3710690 -0.9608139 -1.7813240  0.2699514 -0.8240622 
      gravity            ph          osmo          cond          urea 
-9.861143e-15  8.511409e-17  1.515743e-16 -1.829852e-16  7.335402e-17 
         calc 
-1.689666e-18 
gravity      ph    osmo    cond    urea    calc 
      1       1       1       1       1       1 

Model

Our prior for the \(\beta\) (which we’ll call bb in the model) coefficients will be the double exponential (or Laplace) distribution, which as the name implies, is the exponential distribution with tails extending in the positive direction as well as the negative direction, with a sharp peak at 0. We can read more about it in the JAGS manual. The distribution looks like:

     gravity         ph       osmo       cond        urea        calc
2 -0.1403037 -0.4163725 -0.1528785 -0.1130908  0.25747827  0.09997564
3 -1.3710690  1.6055972 -1.2218894 -0.7502609 -1.23693077 -0.54608444
4 -0.9608139 -0.7349020 -0.8585927 -1.0376121 -0.29430353 -0.60978050
5 -1.7813240  0.6638579 -1.7814497 -1.6747822 -1.31356713 -0.91006194
6  0.2699514 -1.0672806  0.2271214  0.5490664 -0.07972172 -0.24883614
7 -0.8240622 -0.5825618 -0.6372741 -0.4379226 -0.51654898 -0.83726644
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 77
   Unobserved stochastic nodes: 7
   Total graph size: 1096

Initializing model

Potential scale reduction factors:

     Point est. Upper C.I.
b[1]          1       1.00
b[2]          1       1.00
b[3]          1       1.01
b[4]          1       1.01
b[5]          1       1.00
b[6]          1       1.00
int           1       1.00

Multivariate psrf

1
              b[1]          b[2]        b[3]         b[4]         b[5]
Lag 0   1.00000000  1.0000000000  1.00000000  1.000000000  1.000000000
Lag 1   0.83782954  0.2987078057  0.89775903  0.763252490  0.805402070
Lag 5   0.43071069 -0.0146172396  0.58029004  0.348910306  0.392830115
Lag 10  0.20108565  0.0007040613  0.33031998  0.183319130  0.174396225
Lag 50 -0.00152204  0.0121613179 -0.02271751 -0.007778267 -0.006338512
              b[6]          int
Lag 0   1.00000000  1.000000000
Lag 1   0.48781767  0.290085186
Lag 5   0.05229748  0.023234647
Lag 10  0.01188647  0.022378266
Lag 50 -0.01617526 -0.007485411

     b[1]      b[2]      b[3]      b[4]      b[5]      b[6]       int 
1323.5634 8200.7273  808.1154 1451.3082 1424.3403 4959.7263 7338.8099 

Let’s look at the results.


Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

        Mean     SD Naive SE Time-series SE
b[1]  1.6751 0.7726 0.006308       0.021266
b[2] -0.1415 0.2860 0.002335       0.003162
b[3] -0.2821 0.8290 0.006769       0.029195
b[4] -0.7649 0.5204 0.004249       0.013801
b[5] -0.6308 0.6090 0.004972       0.016122
b[6]  1.6153 0.4956 0.004046       0.007048
int  -0.1812 0.3069 0.002506       0.003591

2. Quantiles for each variable:

        2.5%     25%     50%      75%  97.5%
b[1]  0.3449  1.1314  1.6010  2.14544 3.3454
b[2] -0.7333 -0.3247 -0.1250  0.04586 0.3996
b[3] -2.1163 -0.7330 -0.1984  0.16667 1.3038
b[4] -1.8606 -1.0966 -0.7443 -0.39796 0.1629
b[5] -1.9975 -1.0051 -0.5667 -0.17784 0.3808
b[6]  0.7305  1.2720  1.5880  1.92534 2.6926
int  -0.7723 -0.3896 -0.1875  0.02098 0.4270

[1] "gravity" "ph"      "osmo"    "cond"    "urea"    "calc"   

It is clear that the coefficients for variables gravity, cond (conductivity), and calc (calcium concentration) are not 0. The posterior distribution for the coefficient of osmo (osmolarity) looks like the prior, and is almost centered on 0 still, so we’ll conclude that osmo is not a strong predictor of calcium oxalate crystals. The same goes for ph.

urea (urea concentration) appears to be a borderline case. However, if we refer back to our correlations among the variables, we see that urea is highly correlated with gravity, so we opt to remove it.

Our second model looks like this:

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 77
   Unobserved stochastic nodes: 4
   Total graph size: 644

Initializing model

Potential scale reduction factors:

     Point est. Upper C.I.
b[1]          1       1.00
b[2]          1       1.01
b[3]          1       1.00
int           1       1.00

Multivariate psrf

1
             b[1]         b[2]         b[3]         int
Lag 0  1.00000000  1.000000000  1.000000000 1.000000000
Lag 1  0.58793996  0.659269202  0.518618026 0.268846239
Lag 5  0.11264262  0.172292876  0.052563130 0.011156841
Lag 10 0.01394350  0.028131378  0.027109868 0.008667875
Lag 50 0.02214263 -0.004663427 -0.009634286 0.004534375

    b[1]     b[2]     b[3]      int 
3454.705 2532.593 4482.459 8152.918 

Results

Mean deviance:  68.64 
penalty 5.499 
Penalized deviance: 74.14 
Mean deviance:  71.14 
penalty 3.999 
Penalized deviance: 75.14 

Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

        Mean     SD Naive SE Time-series SE
b[1]  1.4203 0.5045 0.004119       0.008616
b[2] -1.3510 0.4539 0.003706       0.009065
b[3]  1.8782 0.5512 0.004500       0.008251
int  -0.1526 0.3216 0.002626       0.003563

2. Quantiles for each variable:

        2.5%     25%     50%      75%   97.5%
b[1]  0.4903  1.0675  1.3948  1.74094  2.4670
b[2] -2.2732 -1.6524 -1.3403 -1.03696 -0.4860
b[3]  0.9040  1.4874  1.8533  2.22382  3.0601
int  -0.7868 -0.3641 -0.1543  0.05851  0.4914
          lower      upper
b[1]  0.4901620  2.4667547
b[2] -2.2363226 -0.4572467
b[3]  0.8602976  3.0022585
int  -0.7868230  0.4915983
attr(,"Probability")
[1] 0.95

[1] "gravity" "cond"    "calc"   

The DIC is actually better for the first model. Note that we did change the prior between models, and generally we should not use the DIC to choose between priors. Hence comparing DIC between these two models may not be a fair comparison. Nevertheless, they both yield essentially the same conclusions. Higher values of gravity and calc (calcium concentration) are associated with higher probabilities of calcium oxalate crystals, while higher values of cond (conductivity) are associated with lower probabilities of calcium oxalate crystals. There are more modeling options in this scenario, perhaps including transformations of variables, different priors, and interactions between the predictors, but we’ll leave it to you to see if you can improve the model.

Prediction from a logisic regression model

How do we turn model parameter estimates into model predictions? The key is the form of the model. Remember that the likelihood is Bernoulli, which is 1 with probability pp. We modeled the logit of pp as a linear model, which we showed in the first segment of this lesson leads to an exponential form for E(y)=p(y)=p.

Take the output from our model in the last segment. We will use the posterior means as point estimates of the parameters.

      b[1]       b[2]       b[3]        int 
 1.4202939 -1.3510461  1.8782490 -0.1526008 

The posterior mean of the intercept was about -0.15. Since we centered and scaled all of the covariates, values of 0 for each xx correspond to the average values. Therefore, if we use our last model, then our point estimate for the probability of calcium oxalate crystals when gravity, cond, and calc are at their average values is 1/(1+e^{-(-0.15)})== 0.4625702.

Now suppose we want to make a prediction for a new specimen whose value of gravity is average, whose value of cond is one standard deviation below the mean, and whose value of calc is one standard deviation above the mean. Our point estimate for the probability of calcium oxalate crystals is 1/(1+e^{-(-0.15 + 1.40.0 - 1.3(-1.0) + 1.9*(1.0))})= 0.9547825.

If we want to make predictions in terms of the original xx variable values, we have two options:

  1. For each x variable, subtract the mean and divide by the standard deviation for that variable in the original data set used to fit the model.

  2. Re-fit the model without centering and scaling the covariates.

Predictive checks

We can use the same ideas to make predictions for each of the original data points. This is similar to what we did to calculate residuals with earlier models.

First we take the X matrix and matrix multiply it with the posterior means of the coefficients. Then we need to pass these linear values through the inverse of the link function as we did above.

        [,1]
2 0.49717423
3 0.10793911
4 0.22085399
5 0.10628912
6 0.27321322
7 0.09079615

These phat values are the model’s predicted probability of calcium oxalate crystals for each data point. We can get a rough idea of how successful the model is by plotting these predicted values against the actual outcome.

Suppose we choose a cutoff for these predicted probabilities. If the model tells us the probability is higher than 0.5, we will classify the observation as a 1 and if it is less than 0.5, we will classify it as a 0. That way the model classifies each data point. Now we can tabulate these classifications against the truth to see how well the model predicts the original data.

       
         0  1
  FALSE 38 12
  TRUE   6 21
[1] 0.7662338

The correct classification rate is about 76%, not too bad, but not great.

Now suppose that it is considered really bad to predict no calcium oxalate crystal when there in fact is one. We might then choose to lower our threshold for classifying data points as 1s. Say we change it to 0.3. That is, if the model says the probability is greater than 0.3, we will classify it as having a calcium oxalate crystal.

       
         0  1
  FALSE 32  7
  TRUE  12 26
[1] 0.7532468

It looks like we gave up a little classification accuracy, but we did indeed increase our chances of detecting a true positive.

We could repeat this exercise for many thresholds between 0 and 1, and each time calculate our error rates. This is equivalent to calculating what is called the ROC (receiver-operating characteristic) curve, which is often used to evaluate classification techniques.

These classification tables we have calculated were all in-sample. They were predicting for the same data used to fit the model. We could get a less biased assessment of how well our model performs if we calculated these tables for data that were not used to fit the model. For example, before fitting the model, you could withhold a set of randomly selected “test” data points, and use the model fit to the rest of the “training” data to make predictions on your “test” set.

Poisson regression

Data

For an example of Poisson regression, we’ll use the badhealth data set from the COUNT package in R.

  numvisit badh age
1       30    0  58
2       20    0  54
3       16    0  44
4       20    0  57
5       15    0  33
6       15    0  28
[1] FALSE

As usual, let’s visualize these data.

Model

It appears that both age and bad health are related to the number of doctor visits. We should include model terms for both variables. If we believe the age/visits relationship is different between healthy and non-healthy populations, we should also include an interaction term. We will fit the full model here and leave it to you to compare it with the simpler additive model.

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 1127
   Unobserved stochastic nodes: 4
   Total graph size: 3673

Initializing model

Potential scale reduction factors:

       Point est. Upper C.I.
b_age        1.05       1.17
b_badh       1.07       1.21
b_intx       1.07       1.23
int          1.05       1.16

Multivariate psrf

1.07
           b_age    b_badh    b_intx       int
Lag 0  1.0000000 1.0000000 1.0000000 1.0000000
Lag 1  0.9568339 0.9648059 0.9670415 0.9529518
Lag 5  0.8289827 0.8669756 0.8727088 0.8248339
Lag 10 0.7026633 0.7648550 0.7714431 0.6972311
Lag 50 0.2181864 0.2220852 0.2361552 0.2120618

   b_age   b_badh   b_intx      int 
262.2733 198.5700 192.1092 267.0209 

Model checking

To get a general idea of the model’s performance, we can look at predicted values and residuals as usual. Don’t forget that we must apply the inverse of the link function to get predictions for \(\lambda\).

     badh age  
[1,]    0  58 0
[2,]    0  54 0
[3,]    0  44 0
[4,]    0  57 0
[5,]    0  33 0
[6,]    0  28 0
       b_age       b_badh       b_intx          int 
 0.008308285  1.553631324 -0.010526445  0.354702836 

It is not surprising that the variability increases for values predicted at higher values since the mean is also the variance in the Poisson distribution. However, observations predicted to have about two visits should have variance about two, and observations predicted to have about six visits should have variance about six.

[1] 7.022633
[1] 41.1963

Clearly this is not the case with these data. This indicates that either the model fits poorly (meaning the covariates don’t explain enough of the variability in the data), or the data are “overdispersed” for the Poisson likelihood we have chosen. This is a common issue with count data. If the data are more variable than the Poisson likelihood would suggest, a good alternative is the negative binomial distribution, which we will not pursue here.

Results

Assuming the model fit is adequate, we can interpret the results.


Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

            Mean       SD  Naive SE Time-series SE
b_age   0.008283 0.002048 1.672e-05      0.0001312
b_badh  1.551187 0.186064 1.519e-03      0.0133989
b_intx -0.010526 0.004297 3.508e-05      0.0003122
int     0.355250 0.080018 6.533e-04      0.0051004

2. Quantiles for each variable:

            2.5%       25%       50%       75%     97.5%
b_age   0.004221  0.006972  0.008308  0.009644  0.012218
b_badh  1.177831  1.432008  1.553631  1.674402  1.912097
b_intx -0.018892 -0.013424 -0.010526 -0.007719 -0.001754
int     0.202702  0.302248  0.354703  0.406303  0.517165

The intercept is not necessarily interpretable here because it corresponds to a healthy 0-year-old, whereas the youngest person in the data set is 20 years old.

For healthy individuals, it appears that age has a positive association with number of doctor visits. Clearly, bad health is associated with an increase in expected number of visits. The interaction coefficient is interpreted as an adjustment to the age coefficient for people in bad health. Hence, for people with bad health, age is essentially unassociated with number of visits.

Predictive distributions

Let’s say we have two people aged 35, one in good health and the other in poor health. What is the posterior probability that the individual with poor health will have more doctor visits? This goes beyond the posterior probabilities we have calculated comparing expected responses in previous lessons. Here we will create Monte Carlo samples for the responses themselves. This is done by taking the Monte Carlo samples of the model parameters, and for each of those, drawing a sample from the likelihood. Let’s walk through this.

First, we need the xx values for each individual. We’ll say the healthy one is Person 1 and the unhealthy one is Person 2. Their xx values are

The posterior samples of the model parameters are stored in mod_csim:

Markov Chain Monte Carlo (MCMC) output:
Start = 1 
End = 7 
Thinning interval = 1 
          b_age   b_badh       b_intx       int
[1,] 0.01251371 1.522677 -0.010775965 0.1742109
[2,] 0.01227413 1.499695 -0.010834899 0.2338410
[3,] 0.01168103 1.490651 -0.008354329 0.2261534
[4,] 0.01139581 1.406510 -0.009096313 0.2555852
[5,] 0.01144687 1.395884 -0.008626509 0.2807728
[6,] 0.01085805 1.398194 -0.009371433 0.2690865
[7,] 0.01028249 1.455677 -0.010255476 0.3066875

First, we’ll compute the linear part of the predictor:

Next we’ll apply the inverse link:

The final step is to use these samples for the \(\lambda\) parameter for each individual and simulate actual number of doctor visits using the likelihood:

[1] 15000

Finally, we can answer the original question: What is the probability that the person with poor health will have more doctor visits than the person with good health?

[1] 0.9189333

Because we used our posterior samples for the model parameters in our simulation, this posterior predictive distribution on the number of visits for these two new individuals naturally takes into account our uncertainty in the model estimates. This is a more honest/realistic distribution than we would get if we had fixed the model parameters at their MLE or posterior means and simulated data for the new individuals.

Multi-level models

Hierarchical models

Data

Let’s fit our hierarhical model for counts of chocolate chips. The data can be found in

First 10 values
chips location
12 1
12 1
6 1
13 1
12 1
12 1
9 1
10 1
7 1
14 1
number of cookies per location
1 2 3 4 5
30 30 30 30 30

We can also visualize the distribution of chips by location.

Prior predictive checks

Before implementing the model, we need to select prior distributions for \(\alpha\) and \(\beta\), the hyperparameters governing the gamma distribution for the \(\lambda\) parameters. First, think about what the \(\lambda\)’s represent. For location j, \(\lambda_j\) is the expected number of chocolate chips per cookie. Hence, \(\alpha\) and \(\beta\) control the distribution of these means between locations. The mean of this gamma distribution will represent the overall mean of number of chips for all cookies. The variance of this gamma distribution controls the variability between locations. If this is high, the mean number of chips will vary widely from location to location. If it is small, the mean number of chips will be nearly the same from location to location.

To see the effects of different priors on the distribution of \(\lambda\)’s, we can simulate. Suppose we try independent exponential priors for \(\alpha\) and \(\beta\).

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   0.021    2.983    9.852   61.127   29.980 4858.786 
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1834    3.3663    8.5488   41.8137   22.2219 2865.6461 

After simulating from the priors for \(\alpha\) and \(\beta\), we can use those samples to simulate further down the hierarchy:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.000     1.171     7.667    83.062    28.621 11005.331 

Or for a prior predictive reconstruction of the original data set:

[1] 66.444084  9.946688  6.028319 15.922568 47.978587
  [1] 63 58 64 63 70 62 61 48 71 73 70 77 66 60 72 77 69 62 66 71 49 80 66
 [24] 75 74 55 62 90 65 57 12  9  7 10 12 10 11  7 14 13  9  6  6 13  7 10
 [47] 12  9  9 10  7  8  6  9  7 10 13 13  8 12  6 10  3  6  7  4  6  7  5
 [70]  5  4  3  6  2  8  4  8  4  5  7  1  4  5  3  8  8  3  1  7  3 16 14
 [93] 13 17 17 12 13 13 16 16 15 14 11 10 13 17 16 19 16 17 15 16  7 17 21
[116] 16 12 15 14 13 52 44 51 46 39 40 40 44 46 59 45 49 58 42 31 52 43 47
[139] 53 41 48 57 35 60 51 58 36 34 41 59

Because these priors have high variance and are somewhat noninformative, they produce unrealistic predictive distributions. Still, enough data would overwhelm the prior, resulting in useful posterior distributions. Alternatively, we could tweak and simulate from these prior distributions until they adequately represent our prior beliefs. Yet another approach would be to re-parameterize the gamma prior, which we’ll demonstrate as we fit the model.

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 150
   Unobserved stochastic nodes: 7
   Total graph size: 322

Initializing model

Potential scale reduction factors:

       Point est. Upper C.I.
lam[1]          1          1
lam[2]          1          1
lam[3]          1          1
lam[4]          1          1
lam[5]          1          1
mu              1          1
sig             1          1

Multivariate psrf

1
             lam[1]       lam[2]       lam[3]      lam[4]        lam[5]
Lag 0   1.000000000  1.000000000  1.000000000  1.00000000  1.0000000000
Lag 1   0.030355236  0.114891517  0.012019002  0.02420960  0.0550134305
Lag 5  -0.011033347 -0.001825634  0.006302119 -0.00340224 -0.0003906281
Lag 10  0.014915445  0.002371482  0.005600358 -0.01068207  0.0023366539
Lag 50 -0.006870905  0.002557641 -0.017849438 -0.01373833 -0.0010541139
                 mu         sig
Lag 0   1.000000000 1.000000000
Lag 1   0.377884708 0.573072968
Lag 5   0.020871448 0.089990578
Lag 10 -0.004499889 0.007465335
Lag 50  0.016514603 0.017038862

   lam[1]    lam[2]    lam[3]    lam[4]    lam[5]        mu       sig 
14097.507 11079.695 15000.000 14529.931 12590.876  6261.560  3782.793 

Model checking

After assessing convergence, we can check the fit via residuals. With a hierarhcical model, there are now two levels of residuals: the observation level and the location mean level. To simplify, we’ll look at the residuals associated with the posterior means of the parameters.

First, we have observation residuals, based on the estimates of location means.

   lam[1]    lam[2]    lam[3]    lam[4]    lam[5]        mu       sig 
 9.281197  6.226731  9.528052  8.949076 11.758216  9.125364  2.088021 

[1] 6.447126
[1] 13.72414

We don’t see any obvious violations of our model assumptions.

Results


Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

         Mean     SD Naive SE Time-series SE
lam[1]  9.281 0.5298 0.004325       0.004467
lam[2]  6.227 0.4600 0.003756       0.004370
lam[3]  9.528 0.5445 0.004446       0.004446
lam[4]  8.949 0.5260 0.004295       0.004365
lam[5] 11.758 0.6150 0.005021       0.005504
mu      9.125 0.9896 0.008080       0.012812
sig     2.088 0.7089 0.005788       0.011699

2. Quantiles for each variable:

         2.5%    25%    50%    75%  97.5%
lam[1]  8.273  8.918  9.274  9.633 10.357
lam[2]  5.363  5.907  6.212  6.534  7.166
lam[3]  8.491  9.157  9.514  9.888 10.622
lam[4]  7.944  8.591  8.937  9.296 10.003
lam[5] 10.582 11.339 11.746 12.164 12.996
mu      7.257  8.508  9.090  9.705 11.204
sig     1.102  1.590  1.953  2.444  3.816

Random intercept linear model

We can extend the linear model for the Leinhardt data on infant mortality by incorporating the region variable. We’ll do this with a hierarhcical model, where each region has its own intercept.

'data.frame':   105 obs. of  4 variables:
 $ income: int  3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
 $ infant: num  26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
 $ region: Factor w/ 4 levels "Africa","Americas",..: 3 4 4 2 4 4 4 4 4 4 ...
 $ oil   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

          income infant   region oil
Australia   3426   26.7     Asia  no
Austria     3350   23.7   Europe  no
Belgium     3346   17.0   Europe  no
Canada      4751   16.8 Americas  no
Denmark     5029   13.5   Europe  no
Finland     3312   10.1   Europe  no

Previously, we worked with infant mortality and income on the logarithmic scale. Recall also that we had to remove some missing data.

'data.frame':   101 obs. of  6 variables:
 $ income   : int  3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
 $ infant   : num  26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
 $ region   : Factor w/ 4 levels "Africa","Americas",..: 3 4 4 2 4 4 4 4 4 4 ...
 $ oil      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ logincome: num  8.14 8.12 8.12 8.47 8.52 ...
 $ loginfant: num  3.28 3.17 2.83 2.82 2.6 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:4] 24 83 86 91
  .. ..- attr(*, "names")= chr [1:4] "Iran" "Haiti" "Laos" "Nepal"

Now we can fit the proposed model:

  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
 [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   
     1  2  3  4
  0 31 20 24 18
  1  3  2  3  0
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 101
   Unobserved stochastic nodes: 9
   Total graph size: 639

Initializing model

Potential scale reduction factors:

     Point est. Upper C.I.
a[1]          1       1.01
a[2]          1       1.01
a[3]          1       1.01
a[4]          1       1.01
a0            1       1.00
b[1]          1       1.01
b[2]          1       1.00
sig           1       1.00
tau           1       1.00

Multivariate psrf

1
            a[1]      a[2]      a[3]      a[4]         a0      b[1]
Lag 0  1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000
Lag 1  0.9129166 0.9173458 0.9125335 0.9313257 0.26347530 0.9784253
Lag 5  0.8354717 0.8419625 0.8390327 0.8547619 0.24745907 0.8966903
Lag 10 0.7448016 0.7507876 0.7480949 0.7600529 0.21619007 0.7985298
Lag 50 0.2961008 0.2981225 0.3033191 0.3046762 0.06845489 0.3158794
              b[2]          sig          tau
Lag 0  1.000000000 1.0000000000  1.000000000
Lag 1  0.136754030 0.0603107165  0.258056648
Lag 5  0.026978792 0.0097383706  0.001325742
Lag 10 0.032334710 0.0194463099  0.019528626
Lag 50 0.007282582 0.0003911277 -0.010478866

      a[1]       a[2]       a[3]       a[4]         a0       b[1] 
  185.1849   186.3214   172.3905   183.7944   846.4287   163.5538 
      b[2]        sig        tau 
 5881.4342 12831.2890  8977.0617 

Meta analysis

Prior Sensitivity Analysis

When communicating results from any analysis, a responsible statistician will report and justify modeling decisions, especially assumptions. In a Bayesian analysis, there is another assumption that is open to scrutiny: the choices of prior distributions. In the models considered so far in this course, there are an infinite number of prior distributions we could have chosen from.

How do you justify the priors you choose? If they truly represent your beliefs about the parameters before analysis and the model is appropriate, then the posterior distribution truly represents your updated beliefs. If you don’t have any strong beliefs beforehand, there are often default, reference, or noninformative prior options, and you will have to select one. However, a collaborator or a boss (indeed, somebody somewhere) may not agree with your choice of prior. One way to increase the credibility of your results is to repeat the analysis under a variety of priors, and report how the results differ as a result. This process is called prior sensitivity analyis.

At a minimum you should always report your choice of model and prior. If you include a sensitivity analysis, select one or more alternative priors and describe how the results of the analysis change. If they are sensitive to the choice of prior, you will likely have to explain both sets of results, or at least explain why you favor one prior over another. If the results are not sensitive to the choice of prior, this is evidence that the data are strongly driving the results. It suggests that different investigators coming from different backgrounds should come to the same conclusions.

If the purpose of your analysis is to establish a hypothesis, it is often prudent to include a ``skeptical" prior which does not favor the hypothesis. Then, if the posterior distribution still favors the hypothesis despite the unfavorable prior, you will be able to say that the data substantially favor the hypothesis. This is the approach we will take in the following example, continued from the previous lesson.

Example

Let’s return to the example of number of doctor visits (Poisson regression). We concluded from our previous analysis of these data that both bad health and increased age are associated with more visits. Suppose the burden of proof that bad health is actually associated with more visits rests with us, and we need to convince a skeptic.

First, let’s re-run the original analysis and remind ourselves of the posterior distribution for the badh (bad health) indicator.

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 1127
   Unobserved stochastic nodes: 4
   Total graph size: 3673

Initializing model

Essentially all of the posterior probability mass is above 0, suggesting that this coefficient is positive (and consequently that bad health is associated with more visits). We obtained this result using a relatively noninformative prior. What if we use a prior that strongly favors values near 0? Let’s repeat the analysis with a normal prior on the badh coefficient that has mean 0 and standard deviation 0.2, so that the prior probability that the coefficient is less than 0.6 is >0.998>0.998. We’ll also use a small variance on the prior for the interaction term involving badh (standard deviation 0.01 because this coefficient is on a much smaller scale).

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 1127
   Unobserved stochastic nodes: 4
   Total graph size: 3679

Initializing model

How did the posterior distribution for the coefficient of badh change?

Under the skeptical prior, our posterior distribution for b_badh has significantly dropped to between about 0.6 and 1.1. Although the strong prior influenced our inference on the magnitude of the bad health effect on visits, it did not change the fact that the coefficient is significantly above 0. In other words: even under the skeptical prior, bad health is associated with more visits, with posterior probability near 1.

We should also check the effect of our skeptical prior on the interaction term involving both age and health.

[1] 0.9486

The result here is interesting. Our estimate for the interaction coefficient has gone from negative under the noninformative prior to positive under the skeptical prior, so the result is sensitive. In this case, because the skeptical prior shrinks away much of the bad health main effect, it is likely that this interaction effect attempts to restore some of the positive effect of bad health on visits. Thus, despite some observed prior sensitivity, our conclusion that bad health positively associates with more visits remains unchanged.

Bibliography

Efron, B., 2013. Bayes theorem in the 21st century. Science 340, 1177–1178.